This analysis demonstrates how to create an interactive heatmap of
correlations to visualize relationships between numerical variables. The
goal is to explore correlation coefficients and observe how they change
when the input data is modified. We will use the diamonds
dataset from the ggplot2 package and visualize
relationships using plotly for interactivity.
In this step, we load the necessary libraries and the
diamonds dataset. We then filter the dataset to include
only numerical variables, as correlation calculations require numerical
data.
# Load the ggplot2 package to access the diamonds dataset
library(ggplot2)
# Load the plotly package for interactive visualizations
library(plotly)
# Load the diamonds dataset
data(diamonds)
# View the first few rows of the dataset
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
The diamonds dataset contains attributes like
carat, price, and depth, which
are suitable for correlation analysis. The head() function
provides a quick preview of the dataset.
We identify and select only the numerical variables from the dataset to compute correlations. This ensures the analysis focuses solely on numerical relationships.
# Identify numerical columns in the dataset
numeric_columns <- sapply(diamonds, is.numeric)
# Create a new dataframe with only numerical variables
diamonds_numeric <- diamonds[, numeric_columns]
The sapply() function checks each column to determine if
it is numeric. We use this information to subset the dataset and create
a new dataframe diamonds_numeric containing only numerical
variables.
Next, we calculate the correlation matrix for the selected numerical variables. The correlation matrix quantifies the pairwise relationships between variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation).
# Compute the correlation matrix
correlation_matrix <- cor(diamonds_numeric, use = "complete.obs")
The cor() function computes pairwise correlations.
Setting use = "complete.obs" ensures rows with missing
values are excluded.
To visualize the correlation matrix interactively, we define a
function that generates a heatmap using plotly. This
heatmap allows users to explore correlations interactively by hovering
over the cells.
# Create a simple function to generate an interactive heatmap with numbers
create_correlation_heatmap <- function(correlation_matrix) {
plot_ly(
x = colnames(correlation_matrix), # Variable names for the x-axis
y = rownames(correlation_matrix), # Variable names for the y-axis
z = correlation_matrix, # Correlation coefficients as z-values
type = "heatmap", # Specify that we want a heatmap
colorscale = "RdBu", # Use a red-blue color scale
reversescale = TRUE, # Reverse the color scale for red-positive correlations
text = round(correlation_matrix, 2), # Add rounded correlation values as text
texttemplate = "%{text}" # Display numbers on the heatmap
) %>%
layout(
title = "Interactive Correlation Heatmap",
xaxis = list(title = "Variables"),
yaxis = list(title = "Variables")
)
}
# Call the function to display the heatmap
create_correlation_heatmap(correlation_matrix)
The function takes a correlation matrix as input and: 1. Maps
variables to the x and y axes. 2. Uses a diverging color scale
(RdBu) to highlight both positive (red) and negative (blue)
correlations. 3. Displays rounded correlation values on the heatmap. 4.
Creates an interactive visualization where users can hover over cells to
view details.
To analyze how correlations change with modified input data, we
create a filtered version of the dataset that includes only diamonds
with carat values less than 2. This focuses on a subset of
diamonds and recalculates the correlation matrix.
# Filter the dataset to include only diamonds with carat < 2
diamonds_filtered <- diamonds_numeric %>% filter(carat < 2)
# Compute the correlation matrix for the filtered dataset
correlation_matrix_filtered <- cor(diamonds_filtered, use = "complete.obs")
# Generate a heatmap for the filtered dataset
create_correlation_heatmap(correlation_matrix_filtered)
Here: 1. The filter() function removes rows where
carat is 2 or more. 2. A new correlation matrix is
calculated using the filtered data. 3. A heatmap is generated for the
modified dataset, allowing for comparison with the original heatmap.
carat may reduce the variability
in related variables, potentially weakening their correlations.Interactive heatmaps are a powerful tool for exploring and visualizing correlations between numerical variables. They allow users to engage with the data dynamically and observe how changes in input data affect relationships. This approach makes complex data more accessible and interpretable, particularly for exploratory data analysis.